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Negative Example Aided Transcription Factor 

Binding Site Search 

Chih Lee and Chun-Hsi Huang 

Abstract — Computational approaches to transcription factor binding site identification have been actively researched for the past 
decade. Negative examples have long been utilized in de novo motif discovery and have been shown useful in transcription factor 
binding site search as well. However, understanding of the roles of negative examples in binding site search is still very limited. 
We propose the 2-centroid and optimal discriminating vector methods, taking into account negative examples. Cross-validation results 
on E. coli transcription factors show that the proposed methods benefit from negative examples, outperforming the centroid and 
position-specific scoring matrix methods. We further show that our proposed methods perform better than a state-of-the-art method. 
We characterize the proposed methods in the context of the other compared methods and show that, coupled with motif subtype 
identification, the proposed methods can be effectively applied to a wide range of transcription factors. Finally, we argue that the 
proposed methods are well-suited for eukaryotic transcription factors as well. 
Software tools are available at: http://biogrid.engr.uconn.edu/tfbs_search/ 

Index Terms — transcription factor, sequence motif, sequence classification, negative example. 
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1 Introduction 

TRANSCRIPTION of genes followed by translation of 
their transcripts into proteins determines the type 
and functions of a cell. Expression of certain genes even 
initiates or suppresses differentiation of stem cells. It is 
therefore crucial to understand the mechanisms of tran- 
scriptional regulation. Among them, transcription factor 
(TF) binding is the one that has been given considerable 
attention by computational biologists for the past decade 
and is still being actively researched. A TF is a protein 
or protein complex that regulates transcription of one or 
more genes by binding to the double-stranded DNA. A 
first step in computational identification of target genes 
regulated by a TF is to pinpoint its binding sites in the 
genome. Once the binding sites are found, the putative 
target genes can be searched and located in flanking 
regions of the binding sites. 

In general, there are two approaches to computational 
transcription factor binding site (TFBS) identification, 
motif discovery and TFBS search. The former assumes 
that a set of sequences is given and each of the se- 
quences may or may not contain TFBS's. An algorithm 
then predicts the locations and lengths of TFBS's. The 
term motif refers to the pattern that are shared by the 
discovered TFBS's. This kind of algorithms relies on no 
prior knowledge of the motif and hence is known as 
de novo motif discovery algorithms. The latter assumes 
that, in addition to a set of sequences, the locations 
and lengths of TFBS's are known. An algorithm then 
learns from these examples and predicts TFBS's in new 
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sequences. Such algorithms are also called supervised 
learning algorithms since they are guided by the given 
sequences with known TFBS's. 

Plenty of efforts have been devoted to the de novo motif 
discovery problem Q, §, §, g), §, §, 0, §, (9), 
(To) , (TT) . Comprehensive evaluation and comparison of 
the developed tools have been performed by Tompa et 
ah (12) and Hu et ah |13) . In this study, we focus on 
the problem of TFBS search. We refer readers interested 
in the motif discovery problem to the evaluation and 
review articles (12), (13), (14) and references therein. 

A typical TFBS search method searches for the binding 
sites of a particular transcription factor in the following 
manner. It scans a target DNA sequence and compare 
each l-mer to the binding site profile of the TF, where 
I is the length of a binding site. Each of the l-mer is 
scored when comparing to the profile. A cut-off score is 
then set by the method to select candidate TF binding 
sites. The position-specific scoring matrix is a widely 
used profile representation, where the binding sites of a 
TF are encoded as a 4 x I matrix. Column i of the matrix 
stores the scores of matching the i th letter in an l-mer to 
nucleotides A, C, G and T, respectively. Depending on 
the method of choice, the score of A at position i can 
be the count of A at position i in the known TFBS's, the 
log-transformed probability of observing A at position i, 
or any other reasonable number. 

Plenty of novel methods were based on this simple 
scoring method. Osada et ah (15| extended this scor- 
ing approach by considering pairs of nucleotides and 
weighting nuclueotide and nucleotide pairs by infor- 
mation content. Extensive leave-one-out (LOO) cross- 
validation (CV) experiments were conducted on 35 TF's 
with totally 410 binding sites. The results showed sig- 
nificant improvement regardless of the model used for 
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motif representation. In a recent study, Salama and Stekel 
|20| showed correlations between two nucleotides within 
a TFBS by plotting the mutual information matrix of a 
motif, reinforcing the findings reported in [15) . A novel 
scoring method called the ungapped likelihood under 
positional background (ULPB) method was proposed in 
this study The ULPB method models a TFBS by two 
first-order Markov chains and scores a candidate binding 
site by likelihood ratio produced by the two Markov 
chains. LOO results on 22 TF's with 20 or more binding 
sites showed that ULPB is superior to the methods 
compared in their work. 

Explicit use of negative examples in the TFBS search 
problem is hindered by the vast amount of non-binding 
sites of a transcription factor. This is further aggravated 
by the low specificity of some transcription factors, 
where a binding site may be more similar to a non- 
binding site than some other binding sites. Due to these 
issues, previous studies involving negative examples 
are limited and the roles of negative examples remain 
unclear. In a review article, Hannenhalli [17| surveyed 
work on improved motif models and integrative meth- 
ods. None of these reviewed studies (17), however, 
investigated the use of negative examples on top of 
true TFBS's. While introducing improved benchmarks 
for computational motif discovery, Sandve et al. (16) 
described algorithms for finding optimal motif models 
using both positive and negative TFBS's. Three models 
were compared using the proposed benchmarks. How- 
ever, no methods relying on only positive examples were 
compared. Recently, Do and Wang |18) formulated the 
TFBS search problem as a classification problem, pro- 
posed a novel similarity measure, and investigated three 
classification techniques. Five-fold CV results showed 
that learning vector quantization performed better than 
P-Match [19 1, which requires only positive examples. 
The evaluation, however, was done on only 8 human 
transcription factors and 8 artificial ones. It is not clear 
how the results on the small set of 8 real TF's can be 
related to other TF's. 

The goal of this study is to investigate the inclusion 
of negative examples in addition to positive ones in 
TFBS search. We propose and characterize two novel 
extensions of the centroid method introduced in |15) . 
Besides the sequence similarity measures employed in 
p5) , we also incorporate the novel similarity measure 
in fl8l into an extension of the centroid method. We 
compare our proposed methods to methods that do 
not rely upon negative examples, that is, the centroid 
method, the ULPB method [20 1 and the well-known 
position-specific scoring matrix method. Performance of 
a method is assessed by LOO CV experiments on two 
data sets of 35 and 26 transcription factors, respectively. 
Moreoever, we discuss the situations when the proposed 
methods can accurately differentiate binding sites from 
non-binding sites. Advantages of coupling motif subtype 
identification with the proposed methods are also dis- 
cussed. 



TABLE 1 

Statistics of the first data set with 35 TF's 



Name 


Length 


# TFBS's 


Name 


Length 


# TFBS's 


Q 


48 


6 


arc ^ 


15 


13 


argK 


18 


17 


R 


15 


12 


crp 


22 


49 


cspA 


20 


4 


cytR 


18 


5 


dnaA 


15 


8 


fadR 


17 


7 


fis 


35 


19 


fnr 


22 


13 


fruR 


16 


12 




18 


9 




16 


7 




20 


4 


2:1dR 


20 


13 


hipB 


30 


4 


ihf 


48 


26 


lexA 


20 


19 


lrp 


25 


14 


malT 


10 


10 


metj 


16 


15 


metR 


15 


8 


nagC 


23 


6 


narL 


16 


10 


ntrC 


17 


5 


ompR 


20 


9 


oxyR 


39 


4 


phoB 


22 


15 


purR 


26 


22 


soxS 


35 


14 


torR 


10 


4 


trpR 


24 


4 


tus 


23 


6 


tyrR 


22 


17 












TABLE 2 






Statistics of the second data set with 26 TF's 


Name 


Length 


# TFBS's 


Name 


Length 


# TFBS's 


MetJ 


8 


29 


Lrp 


12 


62 


SoxS 


18 


19 


H-NS 


15 


37 


FlhDC 


16 


20 


AraC 


18 


20 


Fis 


15 


206 


ArcA 


15 


93 


IHF 


13 


101 


OmpR 


20 


22 


PhoB 


20 


17 


GlpR 


20 


23 


OxyR 


17 


41 


CpxR 


15 


37 


NarL 


7 


90 


CRP 


22 


249 


TyrR 


18 


19 


NarP 


7 


20 


Fur 


19 


81 


LexA 


20 


40 


NtrC 


17 


17 


FNR 


14 


87 


MalT 


10 


20 


PhoP 


17 


21 


ArgR 


18 


32 


NsrR 


11 


37 



The paper is organized as follows. In Section |5J we 
introduce existing methods compared in this study and 
describe two novel methods proposed in this work. 
Leave-one-out cross-validation results on two data sets 
are presented in Section [3] In Section [2J properties of 
the proposed methods are studied and discussed. Con- 
nections between the proposed methods and the other 
compared methods are established. Finally, we give the 
concluding remarks in Section [5] 

2 Methods 

2.1 Data sets 

For ease of comparison, we conduct experiments on 
two data sets used in previous work.The first set was 
collected by Osada et al. (l5j, which consists of 410 
binding sites of 35 TF's with flanking regions located in 
the E. coli K-12 genome (version M54 of strain MG1655 
|2l)). The statistics of this data set are listed in Table [l] 
The second one also contains binding sites of TF's in 
the E. coli K-12 genome and was considered in [20|. We 
downloaded the latest data (release 6.8) from RegulonDB 
p2) and kept only 26 TF's with 17 or more known 
binding sites. We summarize the data set in Table [2] 
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2.2 The centroid and 2-centroid methods 

We introduce the centroid method proposed by Osada 
et al. 1 15 1 in a different manner. We first define the 
similarity measure between two sequences s and t of 
length Z. 



Sim(s,t) = y^ j w i X Si (ti), 



(1) 



2=1 



where (ti) is the i th letter of s (t), Wi denotes the weight 
on the i th letter and X s . (•) is the indicator function given 
by 

1 if ti = Si, 



otherwise. 



In this work, Wi is set to either 1 or the information 
content at position i defined as 

Id = 2+ Y, /i(«)log 2 [/«(«)], (2) 

ue{A, C, G, T} 

where fi(u) is the probability of observing letter u at 
position i. When Wi = 1 for all i, Sim(s, t) simply counts 
the number of letters shared between s and t. When 
pairs of nucleotides are taken into account, the similarity 
measure is defined as follows: 

K l-k 

Sim2(s, t) = Sim(s, t) + ^ ^ w id X s . s . (Utj), (3) 

k=l i=l 

where j = i + k and X s . Sj (•) is the indicator function 
given by 

Is s (Utj) - } 1 iiti = Si and ^ = 8 *> 







otherwise. 



Similarly, Wi j is set to either 1 or the information content 
of the nucleotide pair at given by 

Id j = 4 + Yl hi ( u > v "> log 2 [kj ( M > v )l ' ( 4 ) 

n,vG{A, C, G, T} 

where fij(u,v) is the probability of observing letters u 
and v at positions i and j, respectively. We consider 
only pairs that are at most 2 nucleotides apart (K = 2) 
according to the results reported in (15) . 

To facilitate similarity computation, an Z-mer s can be 
easily embedded in R 4Z while preserving the similarity 
measure in ([!} by the dot product between two vectors. 
That is, letter Si is converted to 4 dummy variables - 
y^XA(s^), ^JwiXc(si), y/wiXQ(si) and ^/wiXj(si) for i = 
1,2,...,/. Fig. [l] illustrates the transformation of an Z-mer 
into a 4Z-element vector when Wi = 1 for i = 1, 2, . . . , Z. 
Similarly, an Z-mer can be transformed into a (36Z — 48)- 
element vector such that the similarity measure in ([5} 
with K — 2 is preserved, where a pair of nucleotides 
is converted to 16 dummy variables. Consequently, the 
similarity between two sequences s and t, can be com- 
puted by s T t, where s and t denote sequences s and 
t, respectively, embedded in the Euclidean space. In the 
rest of the paper, we denote a sequence s embedded in 
the Euclidean space by the same symbol in bold, i.e., s. 



AGTG CTCT 



1 000001 00001 001 01 000001 01 000001 

Fig. 1 . Illustration of embedding an Z-mer in R 4 * with Wi 
1 fori = 1,2,..., Z. 



Consider a set S of n + binding sites of length Z for a 
TF. The centroid method scores an Z-mer t by 



Score(t) 



-E 



ses 



n+ 



ses 



T 



(5) 



where n+ = ^- ^2 seS s is the centroid of the binding 
sites in S. 

Now, with a set N of n_ non-binding sites of length 
Z for the TF, a natural extension of the centroid method 
scores an Z-mer t by 

/ \ T 



Score(t) = /jL+t - 



1 



0+ - M-) 



SE./V 
T 



-E 



(6) 



where = ^-^ s ga^ s ^ s tne centroid of the non- 
binding sites in N. We refer to this method as the 2- 
centroid method in the rest of the paper since it employs 
the centroids of the binding sites and the non-binding 
sites. Fig. [2] illustrates the centroid and 2-centroid meth- 
ods when non-TFBS's as well as TFBS's are available. 
Alternatively, Score (t) in ([6} can be interpreted as fol- 
lows: It measures the average similarity of t to all the 
binding sites, measures the average similarity of t to all 
the non-binding sites and calculates the difference. 

We note that Score(t) in |5| is proportional to 
Score(t) / 1 1 1 1 , where is the length of More- 

over, by virtue of the equality 



IM+II 11*11 cos 0, 



we know Score(t)/||/x + || equals the orthogonal projec- 
tion of t onto ia+, where 9 is the angle formed by vectors 
11+ and t (see Fig. [3] for an illustration). The computation 
of Score (t) is therefore equivalent to computation of the 
orthogonal projection of t onto ii+. Similarly, the com- 
putation of Score(t) in (J6| is equivalent to computation 
of the orthogonal projection of t onto fi+ — fi-. 

2.3 Optimal scoring function 

It can be seen that the scoring functions in ([5} and ^ 
take the following form: 

Score(t) = /3 T t, (7) 

where /3 = /x + for the centroid method and f3 = [i+ — 
fi_ for the 2-centroid method. Therefore, an // optimal ,/ (3 
gives rise to an optimal scoring function with the most 
discriminating power. 
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Component 1 



Fig. 2. Illustration of the 2-centroid method. The solid 
arrow denotes vector n + , while the dashed arrow repre- 
sents vector n + - pointing from to n+. 




Fig. 3. The orthogonal projection of t onto ^ + is equal to 

Score(t)/||/i_|_|| oc Score(t). 



We describe a way of finding an optimal (3. Sup- 
pose that | S | = n + and \N\ = n_, that is, there 
are n + binding sites and n_ non-binding sites for a 
particular TR Let S = {t(i), t(2)? ■ ■ ■ ?£(n+)} an d ^ = 
{^(n + +i)^(n + +2),...^(n)} , where denotes the i ih l- 
mer in S U TV and n = n + + n_. We find the optimal /3 
by solving the following minimization problem: 



min l\\/3\f 



C_ 

n + 



n + 



n_ 



subjectto ^y > 6 + 1 ~ 6 



II/3II 
Score(t (i) ) 

Pi 
& > Vi. 



< 



II/3II 
II/3II 



. E 6 
for t (i) e 5, 

for t {i) e N, 



(8) 

(9) 

(10) 
(11) 



The constraint in (|9| ensures that the projection of a TFBS 
onto the vector /3, — ^^rp^/ exceeds the threshold 



yj^jj. On the other hand, the constraint in {10) ensures 
that the projection of a non-TFBS onto (3 smys below 
the threshold jpjy. Flexibility is given to the thresholds 
by introducing £/s with cost captured by the last two 



terms in ([8}, where C is a positive parameter. Finally, to 
clearly distinguish TFBS's from non-TFBS's, the squared 
difference between the two thresholds and is 



HI/311 

made as large as possible. This amounts to maximizing 
IPII ) or, equivalently, minimizing ^||/3|| 2 , which is 



the first term in {8]>. We call this approach the optimal 
discriminating vector (ODV) method. 

2.4 PSSM and ULPB 

We briefly describe the PSSM (position-specific scoring 
matrix) methods used in (l5) / pO) and the ungapped 
likelihood under positional background method pro- 
posed by Salama and Stekel (20). Consider a specific TF 
with binding sites of length I. The PSSM method used 
in (20) scores an /-mer t by 



(12) 



where no pair of nucleotides was considered for this 
model in [20|. We refer to this method as the position- 
specific probability matrix (PSPM) method to distinguish 
it from the PSSM used in (15) . 

The PSSM method given in (15) takes into account 
background probabilities and scores an l-mer by 



/»(*») 
f(U) 



(13) 



where f(u) is the probability of observing nucleotide u e 
{A, C, G, T}. When nucleotide pairs are considered, the 
score becomes 



^2 w i lo g 



fi(U) 
f(U) 



K l-k 
k=l i=l 



fi,j tj) 
fk {pi j tj ) 



(14) 



where j = i + k, K = 2 and fk{u, v) is the background 
probability of observing letters u and v separated by 
k — 1 arbitrary letters in between. For this method, we 
estimate the background probabilities using only the 
TFBS sequences as in (15) . 

The ULPB models a TFBS by a first-order Markov 
chain and models the background by another first-order 
Markov chain. The former depends on position-specific 
transition probability fi{v\u), which gives the probability 
of observing v at the {i + l) th position given u has 
been seen at position i, where u, v e {A, C, G, T} and 
i = 1,2,...,/ — 1. The latter depends on background 
transition probability f{v\u), the probability of observing 
v given u has been observed at the previous position, 
where u,v e {A, C, G, T}. For this method, the back- 
ground transition probabilities are estimated using the 
entire genome of a species. The ULPB method scores an 
l-mer by 



Iog/i(*i) 



l-l , 

i=i ^ 



fi(U+l\tj) 

f(t i+ i\U) 



(15) 
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Although Salama and Stekel [20) did not consider back- 
ground probability in the first term of flS) , the score 
is approximately the log-likelihood ratio of the two 
Markov chains. 

3 Results 

In this section, we show results of experiments con- 
ducted on the two data sets introduced in Section I2.ll 
Results on the first data set are presented in Section |3.1 
through Section |3.3| while results on the second set are 



summarized in Sections I3.4I 

3.1 Leave-one-out cross-validation 

We conducted LOO CV experiments on the data set 
introduced in the previous section. To allow comparison 
of our results to those obtained by Osada et al. (15) , we 
closely followed the steps described in (15) . We briefly 
describe the LOO CV procedure adopted in Jl5) since 
only the TFBS's are left out in the process. 

Consider a TF with n + TFBS's of length I with flanking 
regions on both sides. A set of negative examples, Attest/ 
called the test negatives is constructed from the TFBS's of 
the other 34 TF's as in [ 15 1. Another set of negative exam- 
ples, iVtrain/ called the training negatives is collected from 
sequences embedding the n + binding sites. It comprises 
all the /-mers except for the TFBS's and two neighboring 
/-mers of each TFBS. 

At each iteration of LOO CV, one of the n + TFBS's 
called the test TFBS is left out. The rest of the TFBS's are 
therefore called the training TFBS's. A scoring function is 
then obtained using the training TFBS's and 5% of non- 
TFBS's randomly sampled from the training negatives. 
The test TFBS along with the non-TFBS's in N test are then 
scored by the scoring function. To score a test sequence, 
both the forward and reverse strands are scored and, in 
case the test sequence is longer or shorter than I, the Z- 
mer producing the highest score is used. The rank of the 
test TFBS is then recorded and the average rank over the 
CV process is computed, where the rank of a TFBS t is 
defined as 1 + \{s e 7V t est|Score(» > Score(t)}|. 

In this study, the weight on nucleotide i, w if is set to 
either 1 or its information content given in (2|. Similarly, 
the weight on a nucleotide pair, Wij is set to either 1 or its 
information content defined in Fig. [4] shows the LOO 
CV results as box plots without and with information 
content, respectively. The best run over 10 runs is listed 
for a method utilizing the training negatives. Results on 
the centroid and PSSM methods reported in [15| were 
faithfully reproduced here. Moreover, from the box plots, 
we can see that methods utilizing negative examples 
perform better than methods considering only positive 
examples. 

To test whether the 2-centroid and ODV methods pro- 
duced lower average ranks than the centroid and PSSM 
methods, we adopted the testing procedure used in (15) . 
The Wilcoxon signed-rank test (23) was performed on 
four pairs of methods. They are (centroid, 2-centroid), 



(PSSM, 2-centroid), (centroid, ODV) and (PSSM, ODV). 
Multiple testing was corrected by the Holm-Bonferroni 
method (24) . The testing was done for each of the 4 
similarity measures, i.e., Sim and Sim2 in Q and j3), 
respectively, with or without weighting by information 
content. Results showed that, at 5% significance level, 
the following relationships can be justified for each 
similarity measure: 2-centroid — » centroid, 2-centroid — » 
PSSM, ODV centroid and ODV -> PSSM, where 



denotes "has a lower average rank than". Fig. 5a and 5b 
show the p-values of the tests on 4 pairs of methods 
without IC and with IC, respectively. 

3.2 The 2-centroid method with a novel similarity 
measure 

Do and Wang |18| proposed a novel distance measure by 
first transforming a sequence of length I into an (/ — 1)- 
element vector. To measure the distance between two 
sequences s and t, t can be shifted to the left or to the 
right (with penalty) to find the best alignment between s 
and t. Since shifting is implicitly done in scoring a non- 
binding site in our CV experiments, we use the distance 
measure without considering shifting: 



Dist(s,t) = ^2\si -U 



(16) 



where s 



(si 52 ... si-i) and t 
U-i) are the sequences s and t embedded 



in R z_1 , respectively. One can see that this is essentially 
the Manhattan distance between s and t. To compute 
the similarity between s and t, we take the negative 
distance as the similarity. 

This similarity measure is then used along with our 
2-centroid method. Fig. [6] compares the performance of 
the similarity measures Sim in ([TJ (w{ = 1, Vi) and 
Sim2 in ([3} (wi = 1, Vi and Wij = 1, Vi, j) to the one 
proposed in (18) . The TF's are ordered by their median 
information content across the I nucleotides, i.e., the 
median of {ICi\i = 1,2,...,/}. A general trend can be 
observed, that is, the performance of a method improves 
as the median information content increases. Looking at 
individual TF's, we can see that the similarity measure 
by Do and Wang gave the lowest average rank on TF 
lrp, performed equally well on TF's hipB and trpR, but 
produced the highest average ranks on all the other TF's. 



3.3 Yet another LOO CV 

Two different sets of negative examples were used in 
the LOO CV experiments presented above since no prior 
knowledge of the test negatives was assumed. We now 
show that, with the knowledge of non-binding sites, 
a small representative set of negative examples can be 
found by a slightly different LOO CV procedure. To 
avoid ambiguity, we constantly refer to sets defined in 
Section 13.11 
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PSSM_P 2-Centroid_P ODV_P 



PSSM_P 2-Centroid_P ODV_P 



(a) (b) 

Fig. 4. Box plots of average ranks of the 35 TF's. A box contains TF's with ranks falling between the 25 th and 75 th 
percentiles, while the median is marked by the horizontal bar in it. The ends of the whiskers mark the minimum and 
maximum of average ranks of all the TF's. A suffix "_P" in name means that the similarity measure given in {3} or the 
score in (14) is used, (a) Each nucleotide or nucleotide pair is given the same weight, (b) Each nucleotide or nucleotide 
pair is weighted by its information content. 




0.00000842 0.00003922 




(a) (b) 

Fig. 5. Results of Wilcoxon signed-rank tests on 4 pairs of methods (a) without IC and (b) with IC. Arrows along with 
p-values point from the superior method to the inferior one. 




TF Name 



Fig. 6. Comparison of three similarity measures using the 2-centroid method. 



Consider a particular TF with n + known TFBS's of which this TF binds but avoid known binding sites of 
length I. Suppose that the goal is to search for sites to other TF's. That is, the binding sites of the other 34 
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2-Centroid_2 



2-Centroid_P_1 2-Centroid_ 



Fig. 7. Box plots of average ranks of the 35 TF's. Each 
nucleotide or nucleotide pair is weighted by its information 
content. 



TF's are assumed known. We first randomly sample a 
representative set of 10n + Z-mers, iV re p, from A^est since 
10n + « 0.05|AT train |. For each iteration of LOO CV, the 
test TFBS is left out. A scoring function is obtained using 
the n+ — 1 training TFBS's and A^ re p. The rank of the test 
TFBS is then calculated based on its score and the scores 
of the non-TFBS's in iV test . The average rank of this TF 
is computed at the end of the LOO CV procedure. A 
good representative set of 10n + negative examples can 
be found by repeating this LOO CV procedure multiple 
times. 

We sampled a representative set of negative examples 
for each TF by repeating the LOO CV procedure 32 
times. Fig. [7] compares average ranks resulted from the 
LOO CV procedure described in this section to those ob- 
tained in the first set of LOO CV experiments. Results of 
the first LOO CV procedure are marked with suffix 
while those of the LOO CV experiments described in this 
section are marked with suffix "_2". As expected, the 
average ranks obtained from the second set of LOO CV 
experiments are lower or comparable to those obtained 
from the first set. Looking at the medians of ODV_P_l 
and ODV_P_2, it may appear that ODV_P_2 performed 
worse than ODV_P_l. However, a statistical test (23) 
indicates that overall ODV_P_2 has lower average ranks 
than ODV_P_l (p-value: 0.06975). 

3.4 ULPB versus other methods 

Since the ungapped likelihood under positional back- 
ground method was evaluated by Salama and Stekel 
pO) on a data set collected from RegulonDB, we con- 
ducted LOO CV experiments using the second data set 
described in Section |2J"j The methods compared to ULPB 
include the position-specific probability matrix (PSPM) 
method, the position-specific scoring matrix method 
with nucleotide pairs (PSSM_P), the 2-centroid method 
with nucleotide pairs (2-centroid_P) and the optimal 
discriminating vector with nucleotide pairs (ODV_P). 



2-Centroid_P 



Fig. 8. Box plots of average ranks of the 26 TF's in the 
second data set. 



PSPM was chosen because it was one of the methods 
compared in |20j. PSSM_P was included because it does 
not require non-TFBS's and it is similar to ULPB in 
that nucleotide pairs are considered. ODV_P and 2- 
centroid_P were compared because they employ non- 
TFBS's explicitly. Information content was not used in 
all the methods compared in this section. 

The methods were evaluated under the same LOO 



CV framework described in Section 3.1 Overall per- 
formance of the compared methods is summarized in 
Fig. [8] The box plots show that overall PSPM gave 
the highest average ranks, which is consistent with the 
results reported in [20 1 that ULPB performed better than 
PSPM. In terms of median marked by the horizontal bar 
inside a box, ULPB appears to be worse than PSSM_P, 2- 
centroid_P and ODV_P. Fig. [9] shows performance of the 
4 methods on individual TF's. We can see that PSSM_P 
performed better than ULPB on 15 out of 26 TF's and 
2-centroid_P/ODV_P performed better than ULPB on 
14 out of 26 TF's. To gauge the significance of these 
observations, statistical tests |23| were performed on 
all the 6 pairs of methods. The results however only 
support that 2-centroid_P outperformed PSPM (p-value: 
0.000722), ODV_P outperformed PSPM (p-value: 0.03344) 
and PSSM_P outperformed PSPM (p-value: 0.006476). 
The p-values of the other tests are all greater than 5%, the 
usual significance cut-off. Similar to Fig. [6j the relation 
between performance and median information content 
can be observed as well. 

4 Discussion 

4.1 No best method for all TF's 

We have shown in the previous section that overall 
methods utilizing negative examples perform better than 
methods using only positive examples. One may be 
tempted to identify the method that gives the lowest 



JOURNAL OF |AT E X CLASS FILES, VOL. 6, NO. 1 , JANUARY 2007 



1200 




200 



Fig. 9. Comparison of the PSPM, PSSM_P, ULPB, 2-centroid_P and ODV_P methods using the second data set. 



average rank for all the TF's. From the results of our 
LOO CV experiments, however, we found that there's 
no combination of method and similarity measure that 
is optimal for all the TF's in the data sets. That is, 
introducing pairs of nucleotide in similarity computation 
or incorporating non-binding sites lowers the average 
ranks for most of the TF's but increases the average ranks 
for a few of them. Fig. [6] serves as an example. It shows 
that the similarity measure proposed by |18| gives the 
highest average ranks for most of the TF's but is the 
best one among the three measures for TF lrp when the 
2-centroid method is used. It also shows that Sim2 yields 
lower average ranks than Sim except for a few TF's such 
as cytR and fur when used along with the 2-centroid 
method. Therefore, instead of finding the combination 
of similarity measure and method that is optimal for all 
the TF's. It is more reasonable and practical to search for 
the best combination of similarity measure and method 
for a particular TF of interest, which can be achieved by 
CV experiments. 



4.2 Complexity of transcription factor binding sites 

Results presented in Fig. [6] and [9] indicate correlation 
between the "complexity" of a TF and its median in- 
formation content across nucleotides. Therefore, we at- 
tempted to establish the relationship between average 
rank and three factors: the length, number of known 
TFBS's and median information content. The average 
ranks on the second data set produced by 2-centroid_P 
in Fig. [9] were linearly regressed |25| on the three factors. 
Aside from the intercept, only the median information 
content was found significant (p-value: 2.89 x 10 -7 ). A 
simple linear regression was then performed to obtain 
the linear relationship between average rank and median 
information content. Fig. 10 shows a scatter plot of 
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average rank versus median information content for the 
26 TF's in the second data set. The straight line represents 
the relationship between average rank and median infor- 



Fig. 10. Linear relationship between average rank and 
median information content. The average ranks were 
obtained by running 2-centroid_P without weighting by 
information content on the second data set. 



mation content found by simple linear regression. The 
median information content can be viewed as a measure 
of conservedness of binding sites of a TF. This reasonably 
implies that the binding sites of a TF are easier to predict 
when they are more conserved. 

4.3 Properties of Investigated Methods 

To reveal properties of methods, we performed pair- 
wise comparisons on some of the methods investigated 
in this work. Fig. 11 shows the pair-wise comparisons 
of centroid_P, PSSM_P, 2-centroid_P and ODV_P with 
information content on the first data set. For each pair 
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Fig. 11. Pair-wise comparisons of centroid_P, PSSM_P, 
2-centroid_P and ODV_P with information content on the 
first data set of 35 TF's. Three factors except for # TF's 
are tested for statistical significance. Significant factors 
are marked by striped bars. 



Fig. 12. Pair-wise comparisons of ODV_P, 2-centroid_P, 
PSSM_P and ULPB without information content on the 
second data set of 26 TF's. Three factors except for # TF's 
are tested for statistical significance. Significant factors 
are marked by striped bars. 



of methods, the 35 TF's were divided into two groups 
depending on the performance of the methods. We then 
looked for statistical difference between the two groups 
in terms of three factors, that is, the number of known 
TFBS's, the median IC and the length of binding sites. 
The comparison between centroid_P and PSSM_P indi- 
cates that PSSM_P performs better than centroid_P on 
21 TF's, i.e., there are 21 TF's in one group and 14 TF's 
in the other. Moreover, when PSSM_P performs better, 
the median IC of a TF is on average 1.10095, which is 
significantly (p-value < 5%) greater than 0.74928, the 
average median IC of a TF when centroid_P performs 
better. Similar interpretations lead to additional com- 
ments as follows. 2-centroid_P requires significantly less 
known TFBS's than PSSM_P ODV_P performs better 
than PSSM_P or 2-centroid_P when a TF has higher 
median IC and shorter binding sites. 

Comparisons were also made between the four com- 
parable methods, ODV_P, 2-centroid_P, PSSM_P and 
ULPB, on the second data set of 26 TF's. Fig. 12 shows 
the bar plots. The plots suggest that 2-centroid_P per- 
forms better than PSSM_P when a TF has higher median 
IC and shorter binding sites. 2-centroid_P performs bet- 
ter than ODV_P when a TF has more known TFBS's, 
ODV_P outperforms ULPB when a TF has less known 
TFBS's and higher median IC, and ODV_P performs 
better than PSSM_P when a TF has less known TFBS's. 

From the observations above, we can see that methods 



utilizing negative examples tend to perform better on 
TF's with higher median information content. This sug- 
gests that the proposed 2-centroid and ODV methods 
are well-suited for identifying eukaryotic transcription 
factor binding sites. Fig. [13] shows the distribution of 
median IC of 459 eukaryotic transcription factors in the 
JASPAR database (26), where 75% (344 out of 459) of 
the TF's have median IC above 1.02. According to our 
analysis shown in Fig. [TT] and 12 the 2-centroid and ODV 
methods perform significantly better than other com- 
pared methods when a TF has relatively high median 
IC. 

Moreover, properties revealed in Fig. |TT| and 12 can po- 
tentially help improve our 2-centroid and ODV methods. 
We can see in Fig.[l0]that the median information content 
of a TF can be as low as 0.05. We suspect that the motif 
of such TF is actually a mixture of two or more motif 
subtypes, which contributes to its low median IC. We 
expect the motif subtypes of a TF to have higher median 
IC. Thus, a method can first identify motif subtypes 
contained in the known TFBS's of a TF and then search 
for individual subtypes. 

4.4 Motif Subtypes Improve the 2-centroid Method 

It has been shown that the binding sites of a TF can 
be better represented by 2 motif subtypes than by a 
single motif (27), [28|. In search for new binding sites, 



JOURNAL OF |AT E X CLASS FILES, VOL. 6, NO. 1 , JANUARY 2007 



10 




Fig. 13. Distribution of median IC of 459 eukaryotic 
transcription factors in the JASPAR database. 



two position-specific scoring matrices are used to score 
an l-mer and the higher score of the two is assigned 
to this l-mer. Searching with two PSSM's was shown to 
be superior to searching with a single PSSM by cross- 
species conservation statistics in these studies. 

To validate our hypothesis proposed in Section [43} we 
coupled motif subtypes with the centroid method as well 
as the 2-centroid method. Our approach to motif subtype 
identification is slightly different from those in previous 
work (27), |28| / while the idea is similar. As usual, all 
the /-mers were first embedded in the Euclidean space 
as described in Section 2.2 The known binding sites of 
a TF were clustered into two subtypes by the /c-means 
algorithm [29|. The centroids of these two subtypes, fi+i 
and n+2, were then computed. The centroid method 
coupled with motif subtypes is denoted by centroid_C 
and it scores an l-mer t by 



max{jx+it,/xl2*} , 

where t denote the l-mer t embedded in the Euclidean 
space. On the other hand, the 2-centroid method coupled 
with motif subtypes is denoted by 2-centroid_C and it 
score an l-mer t by 



max 



where fi- is the centroid of the non-binding sites. 

We assessed and compared centroid_C and 2- 
centroid_C to their counterparts without motif subtypes 
by leave-one-out cross-validation on the second data 
set of 26 TF's. Results summarized as box plots are 
shown in Fig. 14 where Pair denotes the use of nu- 
cleotide pairs and IC indicates weighting nucleotides 



and nucleotide pairs with information content. In all 
the four cases, significant improvement was observed 
when motif subtypes were taken into account. Table [3] 
elucidates the impact of motif subtype identification on 
our 2-centroid method. The first column shows that, 
before introducing motif subtypes, the improvement of 
2-centroid over centroid is only statistically significant 
in the first row. The second column displays significant 
improvement of centroid_C over centroid, which was 
anticipated and consistent with the results reported in 
p7) , (28) . The third column shows significant improve- 
ment of 2-centroid_C over 2-centroid in all four cases. 
We observed that the improvement of 2-centroid_C over 
2-centroid is always more significant than the improve- 
ment of centroid_C over centroid. This implies that our 
2-centroid method benefitted even more from the identi- 
fication of motif subtypes. The last column indicates that, 
after the introduction of motif subtypes, 2-centroid_C 
significantly outperforms centroid_C in all cases. These 
results confirmed our hypothesis that, for TF's with low 
median IC, methods employing non-binding sites should 
be coupled with motif subtype identification. 

Fig. [15] illustrates the application of 2-centroid_C with 
nucleotide pairs to transcription factor FlhDC in the 
second data set. It can be seen in Fig. |15a| that the infor- 
mation content of FlhDC is low at all the 16 positions. 
After motif subtype identification, the two subtypes 
display distinct patterns and the information content 
of the two subtypes was greatly improved as seen in 
Fig. |15b| Fig. |15c| shows a scatter plot of binding sites, 
non-binding sites and their respective centroids, while 
Fig. 15d| shows a scatter plot of binding sites belonging 
to two subtypes, non-binding sites and their respective 
centroids after motif subtype identification. Many bind- 
ing sites are not distinguishable from non-binding sites 
in Fig. |15c| However, after motif subtype identification, 
TFBS's became separable from non-TFBS's as seen in 
Fig. 15d[ resulting in 1.7-fold improvement in average 
rank. 



4.5 Connection between ODV and PSSM/ULPB 

Finally, we elucidate the relation between ODV and 
PSSM/ULPB. We first derive the connection between the 
optimal discriminating vector method and the position- 
specific scoring matrix method. Without loss of general- 
ity, we do not include nucleotide pairs in the derivation 
for simplicity reasons. We abuse notations for a moment 
and let ft (A) = ^-3, A(Q = ^-2, ft(G) = /^-i and 
ft(T) = /3 4i . then becomes 



/3 T *=i>(*i) 



2 = 1 



£iog( 

2=1 ^ 



fi{ti)ki 

~1W 
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Fig. 14. Box plots showing the LOO CV results of methods centroid, centroid_C, 2-centroid and 2-centroid_C. Pair 
denotes the use of nucleotide pairs and IC indicates weighting nucleotides and nucleotide pairs with information 
content. 



TABLE 3 

Improvement by Identifying Motif Subtypes 







2-centroid — > centroid 


centroid_C e — > centroid 


2-centroid_C — > 2-centroid 


2-centroid_C — > centroid_C 


Pair a 


IC b 


# better 


p-value d 


# better 


p-value 


# better 


p-value 


# better 


p-value 


□ 


□ 


19 


2.793 x 1CT 2 


18 


5.093 x 1CT 3 


21 


2.205 x 1CT 5 


21 


1.205 x 1CT 3 


□ 


■ 


18 


5.037 X 10~ 2 


19 


3.727 x 1CT 4 


22 


1.135 x 10- 5 


19 


5.983 x 10- 3 


■ 


□ 


17 


9.937 X 10~ 2 


16 


3.757 x 1CT 2 


23 


6.661 x 1CT 6 


18 


2.806 x 10- 3 


■ 


■ 


17 


1.185 X 10- 1 


17 


7.003 x 10- 3 


20 


2.325 x 10- 4 


19 


8.807 x 10- 3 



a Whether a method uses nucleotide pairs. 

b Whether a method weights nucleotide and nucleotide pairs with information content. 
c The number of TFs supporting the relationship being tested. 
d p-value of the relationship produced by a statistical test |23| . 
e Suffix _C denotes coupling a method with motif subtypes. 



where fi{ti) = ^ exp ( ^j^ ^j f(ti) is the position-specific 
nucleotide frequency for U induced by /%(•) and 



by applying our ODV method described in Section 2.3 



ki — 



J2 exp 

■j.e{A, C, G, T} 



/Wj 



/(«) > o 



is a scaling factor for position i since ODV does not 
impose the constraints ^ue{A, c, g, t} fi( u ) = 1> From 
{17), we note that Yj\=i w i^°^^i does not depend on 
t and thus (3 is optimal if and only if {fi(u)\u e 
{A, C, G, T} and i = 1,2,...,/}, is optimal. Therefore, 
an optimal PSSM can be obtained from our ODV 
method. 

The ungapped likelihood under positional 
background method is similar to the PSSM_P method 
in that both methods score nucleotides and nucleotide 
pairs. The ULPB method scores a l-mer s by looking 
at the first nucleotide si and all the / — 1 adjacent 
nucleotide pairs sis 2 , s 2 s 3 , . . . , Therefore, we can 

embed s in R 20 ^ -16 by transforming si into 4 dummy 
variables and each of the / — 1 pairs into 16 dummy 
variables as described in Section |2.2| An optimal 
discriminating vector (3 e ]g>20Z-i6 can fh en be found 



Following similar arguments, we can see that there is a 
one-to-one correspondence between elements of (3 and 
{fi(u)Ji(v\u)\u, v e {A, C, G, T} and i = 1, 2, . . . , I - 1} 
in (15) . Hence, an optimal ULPB can also be obtained 
from our ODV method. 

One direct implication of the connection established 
above is that a vector obtained by the centroid, 2-centroid 
or ODV methods can be compared to a PSSM model 
in the same framework. As an example, Fig. 
two sequence logos |31| of TF MalT in the 



16 



shows 
second 

data set. The top logo represents the signature of the 
known binding sites, while the bottom one is obtained 
by converting the centroid n+ to a PSSM model as 
in ( [17} with /3 = The two logos display distinct 
patterns of the two methods, implying difference in 
performance. The PSSM method gave an average rank of 
233.9, while the centroid method gave an average rank 
of 69.8. Clearly, the performance difference lies in the 
difference between the two logos. We can see that the 
two logos are very different at positions 3, 5, 6 and 
10. Position 3 indicates that down-weighting letter T 
results in better performance. Position 10 shows that 
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Fig. 15. Illustration of the 2-centroid_C method with nucleotide pairs on transcription factor FlhDC in the second data 
set. Axes in (c) and (d) were found by Fisher's discriminant analysis [30]. (a) Sequence logo before motif subtype 
identification, (b) Sequence logos of two motif subtypes identified by fc-means clustering, (c) Scatter plot of binding 
sites, non-binding sites and their respective centroids, ^ + and n-. The solid arrow identifies the vector n + , while the 
dashed arrow denotes the vector n + - (d) Scatter plot of two clusters of binding sites, non-binding sites and their 
respective centroids, /x + i,/x +2 and The two solid arrows represent vectors /x +i and /x +2 , while the two dashed 
arrows denotes vectors /x +i - fi- and ix +2 - 



the influence of letter A is underestimated in the PSSM 
model. Other positions can be similarly compared and 
interpreted as well. 

5 Conclusion 

In this work, we investigated the use of negative ex- 
amples in the TFBS search problem. To utilize nega- 
tive examples, we proposed the 2-centroid and ODV 
methods, which are natural extensions of the centroid 
method. The proposed methods were compared to state- 
of-the-art methods relying purely on positive examples 
as well as a method considering negative examples. 
Comprehensive LOO CV results showed that non-TFBS's 
are indeed helpful for TFBS search. The large number 



of non-binding sites can be significantly reduced by 
sampling a small representative set by LOO CV. 

Not surprisingly, there is no single best TFBS search 
method or similarity measure for all the TF's. The best 
combination of similarity measure and search method 
can be found for a particular TF by CV experiments. 
Nevertheless, pair-wise comparisons between methods 
revealed interesting properties of methods compared in 
this work. In particular, we showed that the 2-centroid 
and ODV methods are significantly better than the other 
methods when a TF has relatively high median informa- 
tion content. Even for TF's with low median information 
content, preceded by motif subtype identification, the 2- 
centroid method was shown to be effective in searching 



JOURNAL OF |AT E X CLASS FILES, VOL. 6, NO. 1 , JANUARY 2007 



13 




Fig. 16. Two sequence logos of TF MalT. Top: PSSM; 
Bottom: centroid. 

for binding sites belonging to individual subtypes. The 
ODV method can be easily coupled with motif subtype 
identification as well and we believe significant improve- 
ment can be expected. 

All the experiments in this work were conducted on 
prokaryotic transcription factors, i.e., TF's in the E. coll 
K-12 genome. We claim that the proposed 2-centroid and 
ODV are well-suited for eukaryotic transcription factor 
binding site search as well. This is based on character- 
istics of the proposed methods and summary statistics 
of 459 eukaryotic transcription factors in the JASPAR 
database. Finally, we derived the connection between our 
ODV method and the PSSM method, showing that an 
optimal vector in ODV implies an optimal scoring matrix 
in PSSM and vice versa. Properly embedding an /-mer in 
an Euclidean space, the same connection between ODV 
and ULPB can be established as well. 

The effects of negative examples on eukaryotic tran- 
scription factor binding site search will be investigated. 
Our future work also aims for extending our proposed 
methods to handling known binding sites of variable 
lengths. We will seek to approach this problem without 
resorting to multiple sequence alignment, which is noto- 
riously time-consuming. In the meantime, we will also 
seek to identify better similarity measures than those 
investigated in this study. 
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